Whose questions you can answer and which questions you might be interested in?

Outline

The content of this kernel will cover two parts.

  • Part 1: Finding the users who always ask the similar questions with the specific user.
  • Part 2: Finding the users who always provide similar answers with the specific user.

Both parts will be finished with a two-step process: NLP and KNN model fitting. While the first part will be analyzed with the text of questions while the second part will use the text of answers to solve and analyze.


In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cross_validation import train_test_split
from wordcloud import WordCloud,STOPWORDS

Questions=pd.read_csv('./Questions.csv',encoding = 'iso-8859-1')
Answers=pd.read_csv('./Answers.csv',encoding = 'iso-8859-1')


C:\Users\Administrator\Anaconda2\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In [2]:
User_id_inQ= Questions['OwnerUserId'].unique()
User_id_inA= Answers['OwnerUserId'].unique()

In [3]:
All_id=set(User_id_inQ).intersection(User_id_inA)

In [4]:
print('So we have '+str(len(All_id))+ \
      ' users that post both questions and answers on StackOverFlow')


So we have 11194 users that post both questions and answers on StackOverFlow

In [5]:
users=pd.DataFrame({'idUser':list(All_id)})

In [6]:
users.head()


Out[6]:
idUser
0 1900545.0
1 1867780.0
2 4259841.0
3 950280.0
4 5313987.0

In [7]:
users['Quantity']=users['idUser'].apply(lambda x: \
                    len(Questions[Questions['OwnerUserId']==x]['Body']) \
                    +len(Answers[Answers['OwnerUserId']==x]['Body']))

In [8]:
users.head()


Out[8]:
idUser Quantity
0 1900545.0 15
1 1867780.0 3
2 4259841.0 2
3 950280.0 4
4 5313987.0 2

In [9]:
users_final=users.sort(['Quantity'],ascending=0).reset_index(drop=True)
users_final.head()


C:\Users\Administrator\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  if __name__ == '__main__':
Out[9]:
idUser Quantity
0 1855677.0 4997
1 1270695.0 2643
2 2372064.0 2314
3 143305.0 2245
4 1838509.0 2201

In [10]:
users_final=users_final.iloc[0:10000,]
users_final.shape


Out[10]:
(10000, 2)

In [11]:
All_id=list(users_final['idUser'])

Firstly, create a function that can clean the body of questions and answers. Only the main body of questions will be used.


In [12]:
# remove the code part from questions
body = Questions['Body'].str.replace(r'<code>[^<]+</code>',' ')
# build up the question part from questions
Questions['QuestionBody'] = body.str.replace(r"<[^>]+>|\n|\r", " ")

In [13]:
Questions.head()


Out[13]:
Id OwnerUserId CreationDate Score Title Body QuestionBody
0 77434 14008.0 2008-09-16T21:40:29Z 134 How to access the last value in a vector? <p>Suppose I have a vector that is nested in a... Suppose I have a vector that is nested in a d...
1 79709 NaN 2008-09-17T03:39:16Z 1 Worse sin: side effects or passing massive obj... <p>I have a function inside a loop inside a fu... I have a function inside a loop inside a func...
2 95007 15842.0 2008-09-18T17:59:19Z 48 Explain the quantile() function in R <p>I've been mystified by the R quantile funct... I've been mystified by the R quantile functio...
3 103312 NaN 2008-09-19T16:09:26Z 4 How to test for the EOF flag in R? <p>How can I test for the <code>EOF</code> fla... How can I test for the flag in R? For e...
4 255697 1941213.0 2008-11-01T15:48:30Z 3 Is there an R package for learning a Dirichlet... <p>I'm looking for a an <code>R</code> package... I'm looking for a an package which can be u...

In [15]:
# remove the code part from questions
body = Answers['Body'].str.replace(r'<code>[^<]+</code>',' ')
# build up the question part from questions
Answers['QuestionBody'] = body.str.replace(r"<[^>]+>|\n|\r", " ")

In [16]:
Answers.head()


Out[16]:
Id OwnerUserId CreationDate ParentId Score IsAcceptedAnswer Body QuestionBody
0 79741 3259.0 2008-09-17T03:43:22Z 79709 -1 False <p>It's tough to say definitively without know... It's tough to say definitively without knowin...
1 79768 6043.0 2008-09-17T03:48:29Z 79709 5 False <p>use variables in the outer function instead... use variables in the outer function instead o...
2 79779 8002.0 2008-09-17T03:49:36Z 79709 0 False <p>Third approach: inner function returns a re... Third approach: inner function returns a refe...
3 79788 NaN 2008-09-17T03:51:30Z 79709 3 False <p>It's not going to make much difference to m... It's not going to make much difference to mem...
4 79827 14257.0 2008-09-17T03:58:26Z 79709 1 False <p>I'm not sure I understand the question, but... I'm not sure I understand the question, but I...

In [17]:
Q_data=Questions[['OwnerUserId','QuestionBody']]
A_data=Answers[['OwnerUserId','QuestionBody']]
Question=Q_data[Q_data['OwnerUserId'].isin(All_id)]
Answer=A_data[A_data['OwnerUserId'].isin(All_id)]

In [25]:
Question.head()


Out[25]:
OwnerUserId QuestionBody
2 15842.0 I've been mystified by the R quantile functio...
6 37751.0 I know that R works most efficiently with vec...
7 37751.0 So earlier I answered my own question on thin...
9 12677.0 I have imported a time series with dates of t...
10 277.0 I have a CSV of file of data that I can load ...

In [26]:
Answer.head()


Out[26]:
OwnerUserId QuestionBody
6 15842.0 If you're looking for something as nice as Py...
7 1428.0 I use the function: The nice thing ...
11 23813.0 Combining lindelof's and Gregg Lind's ideas: ...
14 37751.0 Linprog, mentioned by Galwegian, focuses on l...
15 37751.0 Clearly I should have worked on this for anot...

In [29]:
Answer.shape


Out[29]:
(151183, 2)

In [32]:
Answer['QuestionBody'][6]


Out[32]:
u" If you're looking for something as nice as Python's x[-1] notation, I think you're out of luck.  The standard idiom is         but it's easy enough to write a function to do this:         This missing feature in R annoys me too!  "

In [35]:
type(Question.QuestionBody)


Out[35]:
pandas.core.series.Series

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
Q_features=tfidf.fit_transform(Question.QuestionBody)
A_features=tfidf.fit_transform(Answer.QuestionBody)

In [36]:
type(Q_features)


Out[36]:
scipy.sparse.csr.csr_matrix

In [37]:
Q_features


Out[37]:
<74619x79620 sparse matrix of type '<type 'numpy.float64'>'
	with 4455120 stored elements in Compressed Sparse Row format>

In [38]:
A_features


Out[38]:
<151183x61379 sparse matrix of type '<type 'numpy.float64'>'
	with 5247469 stored elements in Compressed Sparse Row format>